A Latent Semantic Structure Model for Text Classification
نویسندگان
چکیده
Latent Semantic Indexing (LSI) has been successfully applied to information retrieval and classification. LSI can deal with the problems of polysemy and synonymy, and can reduce noise in the raw document-term matrix. However, LSI may ignore important features for some small categories because they are not the most important features for all the document collection. In this paper, we describe a new approach which extends LSI by incorporating also the classification information of the training documents. In our model, we consider two matrices: document-term and document-class. This model may better capture the latent semantic structure behind the classification examples than LSI.
منابع مشابه
A Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملLatent Dirichlet Allocation
We propose a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams [6], and Hofmann's aspect model , also known as probabilistic latent semantic indexing (pLSI) [3]. In the context of text modeling, our model posits that each document is generated as a mixture of topics, where t...
متن کاملChapter 2 Text Representation and Classification Methods
Text representation and classification method is the most important research objectives of Text Classification. Text representation is prerequisite of Text Classification mainly because it decides the coding ways of text which directly affect classification performance. In this thesis, we have used statistic topic model for the purpose of reducing dimensionality and simultaneously representing ...
متن کاملInfluence of domain information on Latent Semantic Analysis of Hindi text
The work presented in this paper is to evaluate the performance of Latent Semantic Analysis (LSA) model in capturing word correlations within text by including domain information in the process. The performance of the model is empirically evaluated by classification of Hindi text. The accuracies of classification are compared against plain LSA. An increase of 1.25% classification accuracy is ac...
متن کاملLearning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
In this paper, we address the question of what kind of knowledge is generally transferable from unlabeled text. We suggest and analyze the semantic correlation of words as a generally transferable structure of the language and propose a new method to learn this structure using an appropriately chosen latent variable model. This semantic correlation contains structural information of the languag...
متن کامل